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Abstract — Bounds on the entropy of patterns of sequences 
generated by independently identically distributed (i.i.d.) sources 
are derived. A pattern is a sequence of indices that contains all 
consecutive integer indices in increasing order of first occurrence. 
If the alphabet of a source that generated a sequence is unknown, 
the inevitable cost of coding the unknown alphabet symbols can 
be exploited to create the pattern of the sequence. This pattern 
can in turn be compressed by itself. The bounds derived here 
are functions of the i.i.d. source entropy, alphabet size, and letter 
probabilities. It is shown that for large alphabets, the pattern 
entropy must decrease from the i.i.d. one. The decrease is in many 
cases more significant than the universal coding redundancy 
bounds derived in prior works. The pattern entropy is confined 
between two bounds that depend on the arrangement of the letter 
probabilities in the probability space. For very large alphabets 
whose size may be greater than the coded pattern length, all 
low probability letters are packed into one symbol. The pattern 
entropy is upper and lower bounded in terms of the i.i.d. entropy 
of the new packed alphabet. Correction terms, which are usually 
negligible, are provided for both upper and lower bounds. 

I. Introduction 

Several recent works (see, e.g., [1], [3]-[4], [6], [8], [9]) 
have considered universal compression for patterns of inde- 
pendently identically distributed (i.i.d.) sequences. The pattern 
of a sequence is a sequence of pointers that point to the actual 
alphabet letters, where the alphabet letters are assigned indices 
in order of first occurrence. For example, the pattern of the 
sequence "lossless" is "12331433". A pattern sequence thus 
contains all positive integers from 1 up to a maximum value 
in increasing order of first occurrence, and is also independent 
of the alphabet of the actual data. Universal compression of 
patterns is interesting in applications that attempt to compress 
sequences generated by an initially unknown alphabet, such 
as a document in an unknown language. Utilization of the 
necessary coding of the unknown symbols can take place 
by ordering the symbols in their order of occurrence in 
the sequence, and then separately compressing the alphabet 
independent pattern of the sequence. 

To the best of our knowledge, universal compression of 
patterns was first considered in [1], where it was proposed to 
compress sequences from a large known alphabet in which 
not all symbols are expected to occur by separating the 
representation of the occurring alphabet symbols from the 
pattern and compressing each separately. The paper considered 
compression of individual sequences. Later, patterns were 
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rediscovered (and named) in a series of papers [3], [4] (and 
references therein) that considered universal compression for 
unknown alphabets and thoroughly studied the redundancy in 
universal coding of individual pattern sequences. These papers 
demonstrated that the individual sequence redundancy of pat- 
terns must decrease in universal compression compared to the 
redundancy obtained for simple universal compression of i.i.d. 
sequences. Furthermore, unlike the i.i.d. case, it was shown 
that the redundancy in universal pattern compression vanishes 
even if the alphabet is infinite. (This is, of course, related to 
the fact that we loose some information by coding the pattern 
instead of the actual sequence.) The universal average case 
was then studied in [6], [8], [9], where redundancy bounds for 
average case universal compression of patterns were derived. 

The universal description length of patterns, however, con- 
sists of the pattern entropy and the redundancy of universally 
coding the pattern. While most of the emphasis in prior work 
was on the latter, it is clear that a pattern is a data processing 
over the actual sequence, and thus its entropy (the first term) 
must decrease. Furthermore, in [9] (see also [6]), we derived 
sequential codes for compressing patterns and bounded their 
description length. It was shown that for sufficiently large 
alphabets this description length was significantly smaller than 
the i.i.d. source entropy. This points out to the fact that not only 
is there an entropy decrease in patterns, but for large alphabets, 
this decrease is much more significant than the increase in 
description length due to the universal redundancy. Hence, 
to have better understanding even of universal compression 
of patterns, it is essential also to study the behavior of the 
pattern entropy. Pattern entropy is also important in learning 
applications. Consider all the new faces that a newborn sees. 
The newborn can identify these faces with the first time each 
was seen. There is no difference if it sees nurse A or nurse B 
(and never sees the other), as long as it is a nurse. The entropy 
of patterns can thus model the uncertainty of such learning 
processes. The exponent of the entropy gives an approximate 
count of the typical patterns one is likely to observe for the 
given source distribution. If the uncertainty goes to 0, we are 
likely to observe only one pattern. 

We first considered pattern entropy in [7], where we 
bounded the range of values within which the entropy of 
a pattern can be, depending on the specific distribution, as 
a function of the i.i.d. entropy. We showed that for larger 
alphabets, the pattern entropy must decrease with respect to 
(w.r.t.) the i.i.d. one. However, the results were limited to 
distributions that contain only letters with sufficiently large 
letter probability. An upper bound that extends the results for 



unbounded distributions was derived in [10]. Subsequently to 
our initial paper [7], pattern entropy was independently studied 
with different approaches and from a different view of the 
problem in [2] and [5], where the focus has been on limiting 
results for the entropy rate of patterns. 

In this paper, we continue and generalize the results in [7] 
and [10]. We derive general upper and lower bounds for the 
entropy of patterns generated by large and very large alphabets. 
The bounds are presented as functions of a related i.i.d. 
entropy, the alphabet size, and the alphabet letter probabilities. 
The related i.i.d. entropy is that of the i.i.d. source if no 
letters with very low probabilities exist. Otherwise, all the 
probabilities smaller than a threshold are packed into one 
symbol, and the i.i.d. entropy is that of the new alphabet. 
Since the detailed proofs of most of the bounds require lengthy 
rigorous analysis, we only include road maps of the proofs in 
this paper. The complete proofs are presented in [11]. 

The technique used to derive the bounds in this paper relies 
on partitioning the probability space into a grid of points. 
Between each two points, we obtain a bin. For a typical i.i.d. 
sequence of the source, each permutation of the sequence 
letters that only permutes among letters in the same bins, 
has almost the same probability as the typical sequence, and 
results in the same pattern. Such permutations exchange all 
occurrences of one letter by all occurrences of another. The 
probability of the pattern increases from that of the i.i.d. 
sequence by the number of such permutations. This, in turn, 
yields a decrease in the pattern entropy. This idea is used 
directly to derive some of the bounds, and is extended to 
include low probabilities to derive the more general bounds. To 
derive a general upper bound, we propose a low-complexity se- 
quential (non-universal) code for compressing patterns, which 
achieves the bound. The algorithm is, again, based on the idea 
of bins. The use of bins is not easy because the grids that 
determine the bins need to be wisely designed to efficiently 
utilize the probability space. In particular, the grid points 
are taken in increasing spacing. The reason is that for large 
probabilities, the decrease in probability assigned to a typical 
sequence is slower as we shift away from the true letter 
probability. 

The outline of the paper is as follows. In Section |n] we 
define the notation. Section fTTTI reviews initial simple, easy to 
derive, bounds on the entropy, and motivates the remainder of 
the paper. Then, in Section Hvl we derive the upper and lower 
bounds for pattern entropy of i.i.d. sources with sufficiently 
large probabilities, and show the range of values that the 
pattern entropy can take in this case, depending on the actual 
source distribution. Finally, Section|V]contains the derivations 
of more general upper and lower bounds, that do not require 
a condition on the letter probabilities. 

II. Notation and Definitions 

Let x n = (xi, x%, . . . , x n ) be a sequence of n symbols over 

an alphabet S of size k. The parameter 6 = (61,82, ... ,6k) 
contains the probabilities of the alphabet letters. Since the 
order of these probabilities does not affect the pattern, we 



assume, without loss of generality, that 6\ < 6% < • • • < 6k, 
and that £ = {i,l < i < k}. In general, boldface letters will 
denote vectors, whose components will be denoted by their 
indices. Capital letters will denote random variables. 

The pattern of x n will be denoted by ip n = ^ (x n ). Dif- 
ferent sequences have the same pattern. For example, for the 
sequences x n ="lossless", x n ="sellsoll", x n ="12331433", 
and x n ="76887288", the pattern is * (x n ) ="12331433". 
Therefore, for given £ and 9, the probability of a pattern ifj n 
induced by an i.i.d. underlying probability is given by 

Pe{r)= £ Pe{y n ). (1) 

t,":*(y™)=l/>™ 

The probability of \I/ (x n ) can be expressed as in by 
summing over all sequences that have the same pattern with 
a fixed parameter vector. However, we can also express it by 
fixing the actual sequence and summing over all permutations 
of occurring symbols of the parameter vector 

Pe[*(x n )} = J2Pe(«)(x n ), (2) 

cr 

where the summation is over all permutation vectors cr that dif- 
fer among each other in the index of the probability parameter 
assigned to at least one occurring letter, and 6 (oi) denotes the 
zth component of the permuted vector 6, permuted according 
to cr. For example, if 6 = (0.7,0.1,0.2) and cr = (3,1,2), 
then 8 (cr) = (0.2, 0.7, 0.1) and 6 (a 2 ) = 61 = 0.7. 

The entropy rate of an i.i.d. source will be denoted 
by Hg (X). The sequence entropy for an i.i.d. source is 
Hq (X n ) — nHg (X). The pattern sequence entropy of order 
n of a source is defined as 

H e (y n ) = -J2Pe(r)\ogPe(r)- (3) 

As described in SectionU we will grid the probability space 
in order to derive the bounds. Letters whose probabilities lie in 
the same bin between two adjacent grid points will be grouped 
together. We will use two different grids, as defined below, to 
derive the bounds. For an arbitrarily small e > 0, let rj = 
(Vo, r)i,rj2,...,r]b,..., Vb) be a grid of B + 1 points, where 
rjo = 0, ?/i = l/n 1+e , and let, 

r' = V^il = ^ (4) 

3=1 

Then, 

Vb = T' b+n3e/2 _ 2 ; V6 > 2, (5) 

i.e., 7/2 = l/n 1_e , and so on. Clearly, there are B nonzero 
grid points, where B is the rounded down integer of y / n 1+2e — 
n 3£ / 2 + 2. 

We will use kb to denote the number of letters 8i S 
[j]b, ?7&+i]- In particular, fco will denote the number of let- 
ters in S with probability not greater than l/n 1+e , k\ the 
number of letters with probabilities in (l/n 1+e , l/n 1_e ], and 
koi their sum. Let (pi, be the total probability of letters in 
bin b of grid 77. Of particular importance will be (po, <p\, 



defined w.r.t. bins 0, 1, respectively, and tpoi = tpo + <px. 
We use L, and Lb for the mean number of total letters, 
and letters from bin b, respectively, that occur in X n , i.e., 

L b = E^efo.w+i] I 1 - C 1 - °i) n }- 11 is eas y to see that 

h- E e-^<L b <kt- E e-(^), 

(6) 

where in the upper bound summation only 9i < 3/5 are 
included. In particular, for bin 0, 

k k k 

™* - (2) E < < W ~ (?) E + (?) E «?• 

2 — 1 2 — 1 2 — 1 

(7) 

The points b > 1 in grid £ = (£,o,£,i, ■ ■ ■ ,£,§) are defined 
as in ®, but where — e replaced 2e, and also £0 = 0. Here, 
we will use Kb, b > 1, to denote the number of letters whose 
probabilities are in the three adjacent bins surrounding b, i.e., 
0i € (£&-i> £&+i]> with the exception of K\ which will only 
count the letters with probabilities in (^1,^2]- 

Using the definitions above, we can now define two i.i.d. en- 
tropy expressions, where some of the low probability symbols 
are packed into one symbol, 

fc 

i4 M) P0 = -<A>ilog<A)i- E htosOi, (8) 

i=feoi + l 
1 k 

H^ l) (X) = -E^log^" E 0i*>g9i. (9) 

6=0 i=fc 01 +i 

III. Background and Simple Bounds 

It is clear that the pattern entropy satisfies the following 
bounds: 

Theorem 1: If fc < n, 

nHg (X) - log (fc!) < H e < nH e (X) . (10) 

Otherwise, 

nHg (X) - log fc! < H e < nHg (X) . (11) 
(fc — n)\ 

The upper bound is trivial, and the lower bounds are proved in 
[11]. For fc = o(n), the simple bound in dlOt already points the 
fact that if the i.i.d. entropy rate of the source is not vanishing, 
the entropy rate of patterns is equal to the i.i.d. one. However, 
it is clear that for many sources the bounds above are not tight. 

In [9], we derived a universal sequential algorithm for 
coding patterns. The bound on its description length provides a 
bound on the pattern entropy. In particular, if fc > e 19 / 18 -n 1 / 3 , 
where fc is the number of alphabet letters that occur in x n 
with probability at least (1 — e), it was shown that the pattern 
entropy must decrease from the i.i.d. one, and 

H e (*") < nH e (X) - (1 - e) \klo g . (12) 

For sources with very high entropy, for example, 0i = n~ a , Vi, 
for some constant a > 1, the bound increases with n and 
becomes loose. The derivation in [9] can thus be used to 



replace the first term in the bound by n log n. However, this 
still yields a very loose bound on the entropy. 

IV. Bounds for Small and Large Alphabets 

We now consider sources in which 0\ > l/n 1 " 6 , i.e., fcoi = 
0. We present an upper bound and a lower bound for this case, 
and discuss the range of values the entropy can take, where 
for sufficiently large fc, it must decrease from the i.i.d. one. 

A. An Upper Bound 

The following theorem upper bounds the pattern entropy. 

Theorem 2: Let 9{ > l/rt 1_£ , Vi, 1 < i < fc. Then, 

B 

Hg < nHg (X) - (1 - e) E log (*&!) . (13) 

6=2 

To prove Theorem |2J we lower bound the probability of 
patterns generated only from typical sequences x n by the 
sum of probabilities of all typical sequences that have this 
pattern. Using Q and this idea, Pg (x n )] is lower bounded 
by the partial sum of permutations <r of 9, that contains 
only permutations for which for every i and every b, 6 L E 
(Vb,Vb+i] => 6 G (r]b,r)b + i]. For all such permutations 
and a typical sequence x n , the probability assigned to x 11 
decreases at most negligibly w.r.t. the actual probability of 
x n . Hence, for a typical x n , 

log Pg [* (a*)] > logP e (a*) + log M e - o(k), (14) 

where Mg is the number of such permutations cr. Comput- 
ing Me and accounting for the probability of non-typical 
sequences yields the bound of Jl 31 . 

B. A Lower Bound 

The next theorem shows a bound of similar nature to the 
bound of Theorem |3 

Theorem 3: Let 0i > Vi, 1 < i < fc. Then, 

B 

He (*") > nHe (X) - E log («&!) - o(l). (15) 
n 6=1 

To prove Theorem |3] we first define a typical pattern ip n as 
one that is the pattern of at least one typical x n . The number 
of typical sequences x 11 that have a given typical pattern is 
then upper bounded by the product of factorials that leads 
to the second term of the bound. It is then shown that the 
contribution of non-typical sequences to the probability of any 
typical pattern decays exponentially in n a£ , where a is some 
constant. It is necessary to show that even if a typical pattern 
is the pattern of very few typical sequences, the many non- 
typical sequences of this pattern still contribute negligibly to 
its probability. To show that, each set of non-typical sequences 
that have pattern ip n is shown to result from a permutation 
of a typical sequence, where the probability of such a non- 
typical permutation multiplied by a bound on the number of 
such permutations is still negligible w.r.t. the probability of 
the original typical sequence. Finally, a straightforward set 
of equations that breaks the pattern entropy computation into 
typical and non-typical sequences, yields the bound of J15I . 




Fig. 1: Region of decrease from i.i.d. to pattern entropy as 
function of k for n — 10 6 bits with e = 0.1. 

C. Entropy Range 

We now consider the overall range of values the pattern 
entropy can take, regardless of how the letter probabilities 
are lined up in the probability space. It is clear that the 
lower bound in fllOi is tight for a uniform distribution for 
0i > l/n}~ e . The upper bound, however, is restricted by the 
minimum number of permutations that yield a typical sequence 
after permuting another typical sequence. For the simple bound 
in \\0\ . only the identity permutation is counted. However, if 
the number of alphabet symbols is sufficiently large, there must 
be more than one such permutation, because more than one 
letter probability must fall within a single bin of r). Letters with 
probabilities in the same bin in a typical x n can be permuted 
among themselves to another sequence y n that is typical, gives 
the same pattern, and has almost equal probability to x n . 
Not to violate the condition ^2 &i = L most of the letter 
probabilities must be distributed in essentially O (n( 1+£ )/ 3 ) 
lower bins of r\. For sufficiently large alphabets, using the 
smallest possible number of such permutations, yields 

Theorem 4: Let d { > l/n 1-e , Vi, 1 < i < k, and let k > 

n (l+s)/3_ Then ^ 

nH e {X) - log (fc!) < Hg (tf n ) 

< nHe{X)-{l~e)h\o g ^- rj - 3 . (16) 
Theorem |4] gives a range within which the pattern entropy 
must be, depending on the actual letter probabilities. Figure 
shows the region of decrease in the pattern entropy w.r.t. the 
i.i.d. one. For large alphabets, the entropy must decrease es- 
sentially by at least 1.5 log (fc/n 1 / 3 ) bits per alphabet symbol. 

V. Bounds for Very Large Alphabets 

We now consider a more general case, where there is no 
lower bound on the letter probabilities. 

A. An Upper Bound 

A general upper bound on Hg (^ n ) is derived through 
a sequential probability assignment code. A new symbol is 
assigned a joint probability of its index and its bin in the grid 



r/. We thus code the joint sequence (ip n ,P n ), where (3 n is 
the sequence of bin indices corresponding to x n . The average 
description length of this code upper bounds the joint entropy 
He (* n ,B"), which in turn upper bounds Hg 

The probability that is assigned to the joint pat- 
tern and bin sequence is given by Q[(ip n ,f3 n )] = 

Il]=iQ[ipj,Pj I (V^ _1 i/9 i_1 )]- K i>j is an index that al " 
ready occurred in the pattern ip^ 1 , then 

Q[^-,& | {i?- 1 ,?- 1 )] = p Pj , (17) 

where pb (fib/ kb for b > 2, and po and p\ are values assigned 
to letters in the first two bins, that will be optimized later. 
Once an index occurred, it only occurs jointly with the same 
bin number that occurred with its first occurrence. If rpj is a 
new index, and its bin is (3j, the pair is assigned probability 

q | (v- 1 ,/?- 1 )] = m~ c W' 1 ,?- 1 ) - Pj\ -pPi - 

(18) 

where c Utjj^ 1 , /J 7-1 ) ,/3j\ is the number of distinct indices 
that jointly occurred with bin index fij in Upi~ x , /J^ -1 ) (e.g., if 
ij)!- 1 = 1232345 and ^-i = 1222242 then c [(^ 7 ,/9 7 ) ,Pj] 
is 3 for 0j = 2, 1 for j3j = 1 and f3j = 4, and is 0, otherwise). 

This probability assignment groups the probability of all 
the symbols in the same bin into one symbol. Then, each 
occurrence of a new symbol in bin b, it codes a new index 
with the remaining group probability, extracting one count of 
the mean bin probability from the remaining probability in the 
bin. Each re-occurrence of an index assigns the index and its 
attached bin the mean bin probability of the respective bin. For 
bins 6 = 0,1, the mean is replaced by po and p\, respectively. 

Upper bounding the average description length of this code, 
optimizing po and p\ to minimize the bound, yields the 
following upper bound on the pattern entropy. 

Theorem 5: The pattern entropy is upper bounded by 

B 

Hg(* n ) < ni4 04) (X)-^(l-e)log(fc b !) 

6=2 

+ (rvpx - L\) log [minlfci, n}] + n<pih 2 ( — — ] 

\rupij 

where h 2 (a) = —a log a — (1 — a) log(l — a). 
The bound consists of: the packed i.i.d. entropy with bins 
and 1 as one symbol each (the first term), the pattern gain in 
first occurrences of any letter within the remaining bins (the 
second term), the loss in packing bin b = 1 (the next two 
terms), and the loss in packing bin b = (the last term). 
The greatest contribution of the third and the fourth term 
can be shown to be (1 — e)nipi\ogn, and that of the last 
term 0.5(y9on 1_e log (2en 1+e ), which is clearly negligible if 
Hg ' 1 ^ (X) is non-vanishing. 



B. A Lower Bound 

To lower bound Hg (\f r "), the contributions of large and 
small probabilities are separated. The former, of probabilities 
greater than l/n 1_e , is bounded using derivation as in Theo- 
rem [5] The latter is bounded by a straightforward derivation. 
To separate the two, we define a random sequence Z n , such 
that Zj = if 6 Xj < l/n 1 -" and 1 otherwise. Using Z n , 
Hg (W l ) can be expressed as 

Hg = Hg I Z n ) + H e (Z n )-H e (Z n I ¥*) . (20) 

The first term of ( I20l l can now be bounded by splitting 
a particular value z n of Z n into the elements for which 
Zj = 1 and those for which Zj — 0, and bounding 
Hg (\6 r ™ I z n ) separately for each of these sets. We use 
the relation H e (* n | z n ) = E"=i#e(*j I * J '~V n ) > 
E"=i#0 (*j I ^ _1 ,2 n ). The third term of <|20) compli- 
cates the analysis if there is no clear separation between small 
and large probabilities, i.e., there exits e values for which there 
are > letters with probabilities in (l/(2n 1 ^ 6 ), l/n 1_£ ] 
and k^ > letters with probabilities in (l/n 1_E , 3/(2n 1_e )] . 
A permutation between letters in the first bin and letters in 
the second may still result in a typical sequence. Hence, the 
separation must yield a correction term. Applying all the above 
considerations yields the following lower bound: 

Theorem 6: The pattern entropy is lower bounded by 



Hg 



> 




(21) 

The first term in d2 II is the i.i.d. entropy in which all letters 
with probability not greater than l/n 1 ^ 5 are packed into one 
symbol. The second term is the decease in entropy due to first 
occurrences of large probability letters. The next three terms 
are due to the contribution of low probability letters beyond 
that of the super-symbol that merges them. The first two of 
these represent the penalty in packing in repetition of these 
letters, where the third one is the penalty in first occurrence 
of such a letter. We note that the first two of these three terms 
can be separated into contributions of the first ko letters and 
the following k\ letters, to obtain expressions that resemble the 
bound in dl9l . The sixth term of J2 It is the correction term 
from separating small and large probability letters. Finally, 
the last term of o(l) absorbs all the lower order terms of the 
bound. As in the upper bound in (I19> . all the terms beyond the 
first two and the first element of the third term, can be shown 



to contribute at most O (rupi log n) for letters that result from 
bin 1 of rj, and o(n) for letters that result from bin of tj. 

There are several other forms that the bound in (12 1 1 can 
be brought to. In particular, the second term, representing the 
decrease due to first occurrences of large probability letters 
may not be tight if the distribution is close to uniform, but 
symbols appear in very few separate adjacent bins formed by 
£. If this is the case, it may be beneficial to derive a bound 
on the large probabilities using similar methods to the bound 
derived on the low probabilities. In such a bound, the second 
term of 12 H will be replaced by two terms that take the form 
of the third and fifth terms of (12 It . where the ndi leading 
element of the third term is omitted. 

VI. Summary and Conclusions 

We studied the entropy of patterns of i.i.d. sequences. We 
provided upper and lower bounds on this entropy as functions 
of a related i.i.d. source entropy, the alphabet size, the letter 
probabilities, and their arrangement in the probability space. 
The bounds provided a range of values the pattern entropy 
can take, and showed that in many cases it must decrease 
substantially from the original i.i.d. sequence entropy. It was 
shown that low probability symbols contribute mostly as a 
single super-symbol to the pattern entropy, where in particular, 
very low probability symbols contribute negligibly over the 
contribution of this super-symbol. 
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